Attribute Relation Extraction from Template-inconsistent Semi-structured Text by Leveraging Site-level Knowledge

نویسندگان

  • Yang Liu
  • Fang Liu
  • Siwei Lai
  • Kang Liu
  • Guangyou Zhou
  • Jun Zhao
چکیده

A variety of methods have been proposed for attribute-value extraction from semistructured text with consistent templates (strict semi-text). However, when the templates in semi-structured text are inconsistent (weak semi-text), these methods will work poorly. To overcome the templateinconsistent problem, in this paper, we proposed a novel method to leverage sitelevel knowledge for attribute-value extraction. First, we use a graph-based random walk model to acquire site-level knowledge. Then we utilize such knowledge to identify weak semi-text in each page and extract attribute-value pairs. The experiments show that, comparing to the baseline method which does not utilize sitelevel knowledge, our method can improve the extraction performance significantly.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

Agricultural Knowledge Discovery from Semi-Structured Text

This research aims to develop automatic knowledge discovery system from semi-structured Thai text for supporting plant diagnosis. Plant disease diagnosis is very important for farmers to be able to cure infected plants before infections become more severe. Prior to diagnosis, farmers need to gain knowledge retrieved primarily from text, including unstructured and semi-structured document. As th...

متن کامل

High-Precision Web Extraction Using Site Knowledge

In this paper, we study the problem of extracting structured records from semi-structured Web pages. Existing Web information extraction techniques like wrapper induction require a large amount of editorial effort for annotating pages. Other schemes based on Conditional Random Fields (CRFs) suffer from precision loss due to variable site structures and abundance of noise in Web pages. In this p...

متن کامل

Knowledge Base Augmentation using Tabular Data

Large linked data repositories have been built by leveraging semi-structured data in Wikipedia (e.g., DBpedia) and through extracting information from natural language text (e.g., YAGO). However, the Web contains many other vast sources of linked data, such as structured HTML tables and spreadsheets. Often, the semantics in such tables is hidden, preventing one from extracting triples from them...

متن کامل

A Fuzzy Approach for Pertinent Information Extraction from Web Resources

Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures (“wrappers”) for highly structured text such as Web pages. For suitable regular domains, existing wrapper induction algorithms can efficientl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013